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Abstract 

This paper presents a GPU-accelerated implementation of two-dimensional Smart Laplacian smooth¬ 
ing. This implementation is developed under the guideline of our paradigm for accelerating Laplacian- 
based mesh smoothing fl3ll . Two types of commonly used data layouts, Array-of-Structures (AoS) and 
Structure-of-Arrays (SoA) are used to represent triangular meshes in our implementation. Two iteration 
forms that have different choices of the swapping of intermediate data are also adopted. Furthermore, 
the feature CUDA Dynamic Parallelism (CDP) is employed to realize the nested parallelization in Smart 
Laplacian smoothing. Experimental results demonstrate that: (1) our implementation can achieve the 
speedups of up to 44x on the GPU GT640; (2) the data layout AoS can always obtain better efficiency 
than the SoA layout; (3) the form that needs to swap intermediate nodal coordinates is always slower 
than the one that does not swap data; (4) the version of our implementation with the use of the feature 
CDP is slightly faster than the version where the CDP is not adopted. 
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1 Introduction 

The generation of computational mesh models plays a key role in numerical simulation. The quality of 
meshes strongly affects the computational efficiency and the accuracy of numerical results. In general, 
mesh models are needed to be optimized to improve the mesh quality after initially creating. There 
are typically two categories of mesh optimization methods: (1) the first one is mesh clear-up / mod¬ 
ification, which changes the topology of meshes to improve the mesh quality fZTl [5] i5l; and (2) the 
other one is the mesh smoothing, which leaves the mesh connectivity unchanged but to relocate the 
position of mesh nodes to enhance the mesh quality |f7l[6l 181. A number of mesh smoothing methods 
have been introduced and widely used in various applications. Among the mesh smoothing methods, 
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the Laplacian-based and the optimization-based methods are the two types of the most frequently used 
mesh smoothing approaches in practice Bl(T7ll. 

The basic idea behind the Laplacian mesh smoothing is quite simple. In each iteration, the new nodal 
position of each smoothed points is directly the geometric center of those of its adjacent / neighboring 
nodes 0 . There are no other complex smoothing operations. Thus, the Laplacian mesh smoothing is 
computationally non-intensive, and frequently used in various applications. 

However, when smoothing large meshes consisting of a large number of nodes and elements, the 
computational cost is still too high. To improve the efficiency, an effective strategy is to perform the 
mesh smoothing in parallel. For example, Mei, et al. fl3l developed a generic paradigm for accelerating 
the Laplacian-based mesh smoothing on the GPU. Dahal and Newman J6] described efficient GPU- 
accelerated implementations of three triangular 2D mesh smoothing algorithms. In addition, D’Amato 
and Venere Q presented an implementation of a non-Laplacian smoothing method on the GPU to 
optimize tetrahedral meshes in parallel. Other recent efforts attempting at parallelizing mesh smoothing 
include those work presented in lfTll2l fT0l . 

In this paper, we presents an efficient GPU-accelerated implementation of two-dimensional Smart 
Laplacian smoothing. The presented implementation is developed under the guideline of our previously 
proposed paradigm for accelerating Laplacian-based mesh smoothing fl3l . Two types of commonly 
used data layouts, Array-of-Structures (AoS) and Structure-of-Arrays (SoA), are also used to represent 
triangular meshes in our implementation. Furthermore, the feature CUDA Dynamic Parallelism (CDP) 
is employed to realize the nested parallelization in Smart Laplacian Smoothing. The feature CDP is an 
extension to the CUDA programming model which enables a CUDA kernel to create and synchronize 
with new kernel(s) directly on the GPU 11 1 41 . We finally carry out several experimental tests to evalu¬ 
ate the performance of our GPU-accelerated implementation by comparing to the corresponding CPU 
implementation. 


2 Background 

2.1 Laplacian Smoothing 

Laplacian smoothing is one of the most commonly used mesh smoothing algorithms mu. The basic idea 
behind Laplacian smoothing is to relocate each node in a mesh to the geometric center of its neighboring 
nodes; see Figure Q] 

Laplacian smoothing is computationally straightforward but does not always produces enhance¬ 
ments in mesh quality. In practical applications, inverted or even invalid elements in concave regions 
are probably created. To deal with the above problem, some variations such as the Weighted Laplacian 
smoothing 0][T9il and Constrained / Smart Laplacian smoothing l4][20| have been designed. 

The basic idea behind Smart Laplacian smoothing is also simple. For a node being smoothed such 
as the one vq presented in Figure Q] the mesh quality of its incident elements (those elements that share 
this node, e.g., the triangles Tj, T 2 , T 3 , T 4 , and X 5 in FigureH} is first evaluated; Then a newly smoothed 
position of this node is calculated according to the Laplacian smoothing operations. The quality of all 
the incident elements is evaluated again using the new nodal position. If the mesh quality is increased, 
then the new nodal position will be accepted; otherwise, the node will not be repositioned. 

2.2 Iteration Forms 

When calculating the smoothed coordinates of vertices according to the smoothing operation, there are 
typically two forms in terms of selecting the coordinates of neighboring nodes fl3l . These two forms 
can be simply illustrated using the following formulations. 


2 


Accelerating Smart Laplacian Smoothing on the GPU .. 


K. Zhao, G. Mei, N.Xu and J. Zhang 



Before relocating After relocating 

Figure 1: An illustration of Laplacian smoothing 
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where N is the number of neighboring nodes to node i and x- 1 1 is the new position for node i in the 
iteration pass (q + 1). 
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where N is the number of neighboring nodes to node i and xf * ' is the new position for node i in the 
iteration pass (q+ 1). N q and N q +1 are numbers of neighboring nodes derived from the iteration passes 
q and (q+l), respectively. Obviously, the Form A is a special case of the Form B where N q +1 = 0. 


2.3 Data Layouts 

In GPU-accelerated applications there are typically two commonly used data layouts, including the 
Array-of-Structures (AoS) and Structure-of-Arrays (SoA); see a simple illustration in Figure[2] Arrang¬ 
ing data in AoS layout leads to coalescing issues as the data are interleaved. Arranging data as the 
SoA layout makes full use of the memory bandwidth even when individual elements of the structure are 
utilized. In practical applications, it is unable to determine which data layout can always achieve better 
efficiency. In general, the selecting of the proper data layout is application-specific. 


struct PT { 
float x; 
float y; 
bool flag; 

In¬ 
struct PT pts[N]; 

(a) AoS 


struct PT { 
float x[N]; 
float y[N]; 
bool flag[N]; 
In¬ 
struct PT pts; 

(b) SoA 


Figure 2: The data layouts: Array-of-Structures (AoS) and Structure-of-Arrays (SoA) 
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3 Our Implementation 

3.1 Overview 

Our implementation of the Smart Laplacian smoothing is mainly divided into four sub-procedures, in¬ 
cluding the initiating procedure, the finding of neighbors, the determining of constraints, and the iterative 
procedure to obtain the smoothed positions. The initiating procedure is to set several values for each 
vertex and calculate the mesh quality for each triangle. The finding of neighbors is to identify the neigh¬ 
boring vertices / nodes and to find the incident elements / triangles for each vertex. The determining of 
constraints is to identify which vertices are needed to be fixed or free. Finally, the iterating procedure is 
to iteratively calculate the positions of smoothed points. 


3.2 Data Structures 


We design two sets of mesh data structures according to the two layouts AoS and SoA, respectively. 
Corresponding implementations are also developed according to these mesh data structures. Due to the 
limit of the paper length, we only list the data structures represented with the SoA layout; see Listing Q] 


struct cuVert_SOA_ST f 

float *x, *v; // Coordinates of a node 


int *nNeig, *neig; 

// 

Number, 

int *nLoca, *loca; 

// 

Number, 

bool *bBoundary; 

// 

Whether 

float *minQuality; 

\ . 

// 

Qalities 

; r 

struct cuTrgl_SOA_ST 

f 


int *vIDs; 

// 

Indices 

float *Quality; 

// 

Qualitie 


and indices of neighboring nodes 
and indices of incident elements 
nodes are boundary vertices 
of the worst incident elements 


of vertices of triangles 
s of triangles 


}; 

struct cuMesh_SOA_ST { 
int nVert, nTrgl; // 
cuVert_SOA_ST verts; 


Number of vertices and triangles 
cuTrgl_SOA_ST trgls; 


Listing 1: Mesh data structures represented with the SoA layout 


3.3 Implementation Details 

3.3.1 The Initiating Procedure 

This procedure is to (1) calculate the mesh quality metric of each triangle for the first time, and (2) 
initiate several values to prepare for the subsequent calculations. We adopt the mesh quality metric, a, 
proposed by Lee and Lo fl2l to measure the quality of triangular elements. 

We design a simple CUDA kernel to calculate the qualities of all triangular elements, where each 
thread is responsible for computing the quality of a single triangle. The element quality is stored in the 
component float *Quality in the structure cuTrgl_SOA_ST. 

We also design another simple kernel to initiate the values of the components nNeig, nLoca, and 
bBoundary as 0, 0, and false, respectively. Each thread within this kernal / grid is responsible for 
initiating the values for only one points. The above set values mean that: currently there are no found 
adjacent points and incident triangles for any vertex. In addition, any vertex of the mesh is initially set 
to be free point that can be relocated in smoothing. 
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3.3.2 The Finding of Neighbors 

The finding of neighbors is to find: (1) the adjacent / neighboring nodes and (2) the incident triangles 
for each vertex. As described in our previous work 03 ). the finding of neighbors can be quite easily 
performed according to the topology of the mesh. More specifically, for a triangle the second and the 
third vertices must be the adjacent points of the first vertex; similarly, the first and the third vertices are 
definitely the adjacent points of the second vertex. And obviously, each triangle is the incident triangle 
of its three vertices. 

As also explained in our previous work tm due to the data dependencies, we allocate only one 
thread block and a single thread within the thread block to perform the finding of neighbors. The only 
one thread takes the complete responsibilities to finding the adjacent points and incident triangles for 
all vertices. In fact, this procedure carried out on the GPU is the same as the corresponding version 
performed on the CPU. 

3.3.3 The Determining of Constraints 

The determining of constraints is to identify which vertices of a mesh can be relocated in smoothing 
and which cannot be. For planar polygonal meshes, the constraints include the boundary vertices and 
other specifically-defined points such as some feature points. In our implementation, we only consider 
the boundary vertices as constraints. 

We also adopt the method introduced in lfl3l to determine the boundary vertices. More specifically, 
by taking advantage of the indices of the adjacent points, it is quite easy to check whether or not a vertex 
is a boundary one. The basic idea behind determining the boundary vertices is straightforward: if all the 
neighbors of a vertex, e.g., the node vo in Figured have been recorded twice, then the vertex is internal; 
otherwise, it is a boundary vertex. 

We develop a specific kernel to carry out this determine of boundary vertices. Each thread within 
the thread grid is invoked to check whether a points is a boundary one, i.e., to check whether all the 
neighbors / adjacent points of a vertex has been recorded twice. If not, then update the corresponding 
flag value bool bBoundary from being false to true. 

3.3.4 The Iterating Procedure 

The final and key procedure is to iteratively calculate the smoothed positions of all vertices. The main 
feature of this procedure is that: only one thread block is allocated because of the data dependencies. 
In Smart Laplacian smoothing, the calculating of the positions of all smoothed points in the (i + 1 ) th 
iteration depends on the positions of all smoothed points in the i th iteration. In other words, there exist 
data dependencies between the (i + l ) f h iteration and the i th iteration. 

Due to the data dependencies existing in different passes of iterations, only one thread block is 
allocated. Each thread within the thread block is invoked to calculate the smoothed positions of (n 
+ BLOCK_SIZE - 1) / BLOCK_SIZE vertices in one iterative step, where n is the number of all 
vertices and BLOCK_SIZE denotes the number of threads within the only one thread block. The bar¬ 
rier of synchronization __syncthreads () is used to guarantee all threads within the only one block 
finishing calculating one pass of all smoothed positions. 

In Smart Laplacian smoothing it is “Smart” to determine whether or not a non-constrained vertex 
should be relocated. This “Smart” determination is typically carried out by comparing the mesh quality 
of the original mesh before relocating the vertex and the mesh quality after the relocating. In other 
words, when using the new nodal coordinates after relocating, if the mesh quality of the local mesh 
(i.e., all the incident triangles) of the vertex being smoothed definitely improves, for example, if the 
min value of the mesh quality metric of all the incident triangles is increased, then this vertex should be 
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repositioned; otherwise, the position of the vertex will be leaved unchanged. Therefore, after obtaining 
the new position of a smoothed vertex, it is needed to temporarily re-evaluate the mesh quality of the 
local mesh, and then compare the mesh qualities. 

After completely obtaining all the smoothed positions in one iteration step, the mesh quality of all 
triangles is needed to be updated using all of the new nodal coordinates. The qualities of all triangles 
updated in i th iteration will be used to “Smart” calculate the new smoothed positions in the (i + 1 ) th 
iteration. We implement the above updating of mesh quality in each iteration step by with or without 
the use of the feature, CUDA Dynamic Parallelism (CDP). 

“GPU-” versions: In this version, the feature CDP is not adopted. Alternatively, in this kernel of 
iterating, each thread within the only thread block is responsible for: ( 1 ) calculating the quality of all 
of its incident triangles, and (2) finding the min value of the qualities of the incident triangles. The min 
quality is then stored in the component float minQuality of the data structure cuVert_SOA_ST. 

“CDP-” versions: In this version, the feature CDP is used. We design a child kernel specifically 
for updating the quality of all triangles by using new nodal coordinates. Within this child kernel, each 
thread is invoked to evaluate the quality of only one triangle. We also develop another child kernel for 
finding the min quality of the incident triangles for all vertices. Each thread in this kernel is responsible 
for finding the min values of the quality of the incident triangles. Note that the above two child kernels 
are only invoked once in the parent kernel of iterating. 


4 Results 

To evaluate the performance of our GPU implementations, we perform the experimental tests on the 
GeForce GT640 (GDDR5) graphics cards with CUDA 6.5. The CPU experiments are performed on 
Windows 7 SP1 with a dual Intel i5 3.2 GHz processor and 8 GB of RAM memory.We perform the 
experimental tests for both the GPU- and CDP- versions of our implementation. Each version of our 
implementation is tested in two cases where the iteration Form A and Form B are adopted. 

The test data includes 5 planar triangular meshes which are composed of 1, 5, 10, 50, and 100K ver¬ 
tices, respectively. The original unsmoothed triangular meshes are generated according to the standard 
Delaunay triangulation algorithm. First, five sets of uniformly distributed points in 2D are randomly 
generated using the generator provided by Qi, et al. ED; and then Delaunay meshes are created for 
these sets of discrete points using the library Triangle m. 

We evaluate our implementation on the single precision. For the GPU-version, the running time and 
corresponding speedups achieved when using the Form A and Form B are listed in Table Q] and Table 
[2] respectively. Similarly, For the CDP-version, the running time and corresponding speedups obtained 
when iterating in the Form A and Form B are presented in Table 0 and Table HI respectively. We find 
that the highest speedup is up to 44.76x; see Table [3] 


5 Discussion 

We have investigated the related work involving the accelerating of mesh smoothing on the GPU, and 
have found that currently there are only several recent efforts lfl3l 171 l 6 ll. Among the above efforts, the 
work presented in JT) focused on smoothing tetrahedral meshes, while in fl3l and (6j the work aimed 
at developing the GPU-accelerated mesh smoothing for planar meshes. 

The GPU-accelerated implementations of 2D Laplacian mesh smoothing were both introduced in 
fl3l and ( 6 ). However, the above implementations are only developed for the original Laplacian smooth¬ 
ing, rather than the Smart Laplacian smoothing. To the best of the authors’ knowledge, the work pre¬ 
sented in this paper is the first attempt at accelerating Smart Laplacian smoothing on the GPU. 
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Table 1: Performance of the GPU-version developed in the iteration Form A 


Size 


Running time (/ms) 

Speedup 

CPU 

GPU-AoS 

GPU-SoA 

GPU-AoS 

GPU-SoA 

IK 

218 

16.0 

18.8 

13.63 

11.60 

5K 

1217 

72.2 

86.0 

16.86 

14.15 

10K 

2012 

117.6 

138.1 

17.11 

14.57 

5 OK 

14383 

530.5 

618.0 

27.11 

23.27 

100K 

33836 

802.8 

932.8 

42.15 

36.27 


Table 2: Performance of the GPU-version developed in the iteration Form B 


Size 


Running time (/ms) 

Speedup 

CPU 

GPU-AoS 

GPU-SoA 

GPU-AoS 

GPU-SoA 

IK 

171 

14.9 

17.7 

11.48 

9.66 

5K 

1139 

52.4 

76.6 

21.74 

14.87 

10K 

1888 

79.3 

115.0 

23.81 

16.42 

5 OK 

11918 

424.5 

513.1 

28.08 

23.23 

100K 

26021 

776.1 

896.3 

33.53 

29.03 


Table 3: Performance of the CDP- version developed in the iteration Form A 


Size 


Running time (/ms) 

Speedup 

CPU 

GPU-AoS 

GPU-SoA 

GPU-AoS 

GPU-SoA 

IK 

218 

14.1 

16.5 

15.46 

13.21 

5K 

1217 

59.2 

66.8 

20.56 

18.22 

10K 

2012 

96.1 

107.7 

20.94 

18.68 

5 OK 

14383 

477.7 

532.4 

30.11 

27.02 

100K 

33836 

755.9 

832.5 

44.76 

40.64 


5.1 Impact of Data Layouts (AoS and SoA) 

We have observed that: the version of our implementation that is developed using the data layout AoS is 
slightly faster than that is developed using the layout SoA. This behavior has previously been observed 
in our previous work Q3- As also explained in our previous work, the better performance obtained by 
the GPU version based upon the AoS data structures is due to the use of the aligned global memory 
accesses. Therefore, in practical applications, we recommend the developers to use the data layout AoS 
to implement the Smart Laplacian smoothing. However, it is should be also noted that: the design of the 
AoS format mesh data structures is much more complex than that of the SoA formant. 
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Table 4: Performance of the CDP-version developed in the iteration Form B 


Size 


Running time (/ms) 

Speedup 

CPU 

GPU-AoS 

GPU-SoA 

GPU-AoS 

GPU-SoA 

IK 

171 

13.5 

15.4 

12.67 

11.10 

5K 

1139 

50.0 

58.1 

22.78 

19.60 

10K 

1888 

74.8 

82.8 

25.24 

22.80 

5 OK 

11918 

404.4 

424.8 

29.47 

28.06 

100K 

26021 

750.2 

801.7 

34.69 

32.46 


5.2 Impact of Iteration Forms 

By comparing the absolute running time of two variations developed using the two iteration forms, i.e., 
the Form A and the Form B, we have found that: the variation with the Form B is a little faster than the 
variation with the Form A in the cases whenever which data layour (AoS or SoA) is adopted or whether 
the feature CDP is used; see Figure 0for a groups of performance comparison. 

The causes to the above results have been also described in our previous work fl3l . The first cause 
is that: the Form A needs more calculations due to swapping intermediate nodal coordinates during 
iterating. The other cause is that: the convergence speed of the Form A is much lower than that of the 
Form B; and thus the Form A needs much more iterations for converging. Therefore, the Form B is 
suggested in practical applications. 



IK 5K 10K 50K 100K 

Size 


(a) AoS 



IK 5K 10K 50K 100K 

Size 


(b) SoA 


Figure 3: Comparison of the running time of two variations implemented in the Form A and Form B 


5.3 Performance of the Use of CUDA Dynamic Parallelism 

We have observed that: the “CDP” version is slightly faster than the “GPU” version. This is perhaps 
leaded by the following causes. In the GPU version, only a thread block is allocated; and after obtaining 
the new positions of interior points, the quality of local mesh for each point is needed to be re-evaluated. 
Within this only one thread block, a single thread takes the responsibilities for re-calculating the quality 
of local meshes for several points, rather than only one point. In this case, the power of massively 
parallel computing is not fully exploited. The second reason is that: when re-evaluating the quality of 
the local mesh for a single point, each of the incident elements is needed to evaluate it quality once. 
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Due to the fact that a triangle has three vertices, it is thus needed to evaluate its quality three times. 
Obviously, there exists redundancies in the calculating of the quality of triangles. 



IK 5K 10K 50K 100K 

Size 

(a) Using the layout So A in Form A 



IK 5K 10K 50K 100K 


Size 

(b) Using the layout AoS in Form A 


Figure 4: Comparison of the speedups of the GPU-version and CDP-version of our implementation 

In the CDP version, we first design a specific kernel to re-calculate the quality of all triangles after 
obtaining the new positions in each iteration. Each thread within this kernel is responsible for computing 
the quality metric of only one triangle. We also develop another kernel to find the worst triangle with 
the lowest quality of the local mesh for each point. For that the quality of each of the incident elements 
(i.e., triangles) have been calculated, it is quite easy to find the worst triangle that has the lowest quality 
among the incident elements. In this kernel, each thread is designed to find the worst incident element 
for only one point. Obviously, in this CDP version, the power of the massively parallel computing in 
the above two child kernels / grids can be efficiently exploited. However, the computing capability of 
the parent kernel is not fully exploited since only one thread (typically the first thread within the thread 
grid) is needed to invoke the above two child kernels. This is probably the reason why the CDP version 
is slightly rather than significantly faster than the GPU version. 


6 Conclusion 

We have presented an efficient GPU-accelerated implementation of the Smart Laplacian smoothing for 
optimizing planar triangular meshes. We have developed our implementation by using two data layouts, 
Array-of-Structures (AoS) and Structure-of-Arrays (SoA), and employing two iteration forms (Form 
A and Form B). The feature CDP is also adopted to realize the nested parallelization to iteratively 
determine the smoothed vertices’ positions. We have evaluated the performance of our implementation 
using five randomly created triangular meshes on the GPU GT640. Experimental results have indicated 
that: our implementation can achieve the speedups of up to 44x. We have also found that: the data 
layout AoS can obtain better efficiency than the SoA layout. It has been demonstrated that: the Form A 
that needs to swap intermediate nodal coordinates is always slower than the Form B that does not swap 
data. We have also observed that: the version of our implementation with the use of the feature CDP is 
slightly faster than the version where the CDP is not adopted. 
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